Hospital LOS (Length-of-Stay)

First of all what is LOS? Hospital length-of-stay (LOS) is defined as the time between hospital admission and discharge measured in days.

1. Problem Statement

The goal is to create a model that predicts the length-of-stay for each patient at time of admission.

In order to predict hospital LOS, the MIMIC data needed to be separated into terms of:

2. Type of model used for prediction

Since LOS is not a categorical but continuous variable (measured in days), a regression model will be used for prediction.

3. Metrics used for validation

The expected outcome is that the model we use will be better at predicting hospital LOS than the industry standards of median and average LOS. The median LOS is simply the median LOS of past admissions to a hospital. Similarly, a second commonly used metric in healthcare is the average, or mean LOS.

So, to measure performance of our model, we'll compare the prediction model against the median and average LOS using the root-mean-square error (RMSE). The RMSE is a commonly used measure of the differences between values predicted by a model and the values observed, where a lower score implies better accuracy. For example, a perfect prediction model would have an RMSE of 0.

The RMSE equation for this work is given as follows, where (n) is the number of hospital admission records, (y-hat) the prediction LOS, and (y) is the actual LOS.

RMSE

We could say we have a successful model if its prediction results in a lower RMSE than the average or median models.

There is a multitude of regression models available for predicting LOS. To determine the best regression model between the subset of models that will be evaluated, the R2 (R-squared) score will be used.

R Square measures how much variability in dependent variable can be explained by the model. In other words, it is the proportion of the variance in the dependent variable that is predictable from the independent variables. R2 is defined as the following equation where (y_i) is an observed data point, (ŷ) is the mean of the observed data, and (f_i) the predicted model value.

R2

Best possible R2 score is 1.0 and a negative value means it is worse than a constant model, average or median in this case.

4. Data extraction, exploration and feature engineering

ADMISSIONS table exploration and feature engineering

Length of stays computation

The LOS is not explicitly expressed as attribute in the admission table, so we have to calculate it. As we said, LOS is defined as the time between admission and discharge from the hospital.

We noticed that the mean LOS is 10 days, nut we noticed also that the min LOS calculated is a negative value, how is it possible that a LOS is negative? Let's see records associated with negative values of LOS:

Now we see how the min value for LOS is not negative anymore. To have a more informative view on the distribution of LOS values we plot those values:

Another thing to consider is admissions of patients who died at the hospital. This kind of admissions resulting in death will be excluded as they would bias the LOS since LOS would be shorter for this group.

We also said that we'll use the LOS mean and median for comparison and for understand the accuracy of our model. So let's compute these LOS metrics that we'll use later for model evalutaion.

Reduction number of categories of ETHNICITY, ADMISSION_TYPE, INSURANCE

We notice that there are a lot of ETHNICITY categories, but most of them are subcategories so we could reduce this number just by considering the main category.

It's interesting to notice that ASIAN patients have the lowest median LOS, even if they are smaller in number in comparison to other ETHNICITY categories.

Now let's do the same analysis done for ETHNICITY also for ADMISSION_TYPE and INSURANCE, if necessary, to reduce the number of possible categories.

The number og categories in ADMISSION_TYPE is not that high, but the category URGENT contains a smaller subset of entries in comparison to others. In addition it is a lot similar semantically to EMERGENCY, so could combine these two categories and consider URGENT as EMERGENCY.

As we could expected newborns have the lowest median LOS followed by elective admissions. This is expected since these are often somewhat planned for and with the risks being understood in comparison to EMERGENCY ADMISSION_TYPE.

Finally, let's do the same fot INSURANCE feature.

Analyzing this attribute we see that the categories are pretty distinct, but we notice that if a patient is 'Self-Pay', typically it could mean that they can't or didn't pay, as matter of fact, Self-pay patients have the lowest LOS because they can't pay to stay more at the hospital.

PATIENTS table exploration and feature engineering

In PATIENTS table we have DOB (Date of Birth) but not the age of the patient, so to have a look on the age distribution between patients we have to compute this AGE attribute. We can compute it with the difference between a patient date of birth (DOB) and th date of their first admission, so we ignore subsequent admissions in this calculation.

To do this, let's first merge the PATIENTS table with ADMISSIONS table explored previosly.

Age calculation

As we can see from age distribution, patients in their childhood are not present, this reflects the fact that MIMIC-III does not contain data from pediatric patients.

Now let's see how the LOS, our current goal, is correlated to ther age of the patients.

The plot highlights the MIMIC groups of newborns and >89 year olds, and there is an increasing amount of admissions going from 20 toward 80 years old. Because of the discrete-like distribution of data on the extremes of age, it could be useful to convert all ages into the categories of newborn, young adult, middle adult, and senior for use in the prediction model.

Finally, let's see the distribution of gender in patients in correlation to LOS, before to explore the next table.

DIAGNOSES_ICD table exploration and feature engineering

International Classification of Diseases, Clinical Modification (ICD-9-CM) is an adaption created by the U.S. National Center for Health Statistics (NCHS) and used in assigning diagnostic and procedure codes associated with inpatient, outpatient, and physician office utilization in the United States (from WIKIPEDIA https://en.wikipedia.org/wiki/International_Classification_of_Diseases#ICD-9 ).

Because it's not feasible to have 6984 unique values to use as features for predicting LOS, it is necessary to reduce the diagnosis into more general categories. After researching the ICD9 approach, (from WKIPEDIA source https://en.wikipedia.org/wiki/List_of_ICD-9_codes) it's been noticed that they are arranged into super categories as the following:

listICD9codes

As we see our attention could be just on the numeric first 3 values. So our task now is to recode each ICD-9 code to its supercategory.

For each admission, usually there is more than one diagnosis. Often, there are more than 1 diagnoses for 1 category.

We could create a matrix that highlights all the diagnoses for each admission. This should not be done on the SUBJECT_ID since each patient could have different diagnoses for each admission.

Looking at the median LOS for each ICD-9 supercategory shows an impressive spread between pregnancy and skin diagnosis code groups.

ICUSTAYS table exploration and feature engineering

From this statistic we can see how, as far as LOS is concerned, a substantial difference in the median is found only between NICU and the other categories. The other categories have a very similar median. We can therefore think of simply reducing on two groups: NICU and ICU (which includes all the others).

5. Data cleaning

The final DataFrame size resulted in 37 feature columns and 1 target column (LOS) with an entry count of 53,104.

6. Prediction Model

What is Supervised Learning and why we choose it? TO DO

We will implement the supervised learning prediction model using the Scikit-Learn machine learning library.

To implement the prediction model, our dataset is splitted into training and test sets at an 80:20 ratio using the scikit-learn train_test_split function.

why split in training and test set? TO DO

Using the training set, we'll fit five different regression models (from the scikit-learn library) using default settings to see what the R2 score comparison looked like.

The GradientBoostingRegressor has the best R2 score of ~37% so we focus on refining this particular model.

7. Parameter Tuning

To refine the GradientBoostingRegressor model, GridSearchCV function from scikit-learnis used to test out various permutations of parameters such as n_estimators, max_depth, and loss.

What is GridSearchCV and in general how it works? TO DO

The best estimator result from GridSearchCV was n_estimators=200, max_depth=4, and loss=ls.

Diagnoses related to prenatal issues have the highest feature importance coefficient followed by respiratory and nervous. So we could say that one of the results is that the ICD-9 diagnoses categories are by far the most important features.

In previous metric section, we said that the RMSE would be used to compare the prediction model versus the industry-standard average and median LOS metrics. The gradient boosting model RMSE is better by more than 24% (percent difference) versus the constant average or median models (as we can see from the graphic below).

Another way to look at the model could be to plot the proportion of accurate predictions in the test set versus an allowed margin of error. Other studies qualify a LOS prediction as correct if it falls within a certain margin of error. Obviously, it follows that as the margin of error allowance increases, so should the proportion of accurate predictions for all models. The gradient boosting prediction model performs better than the other constant models across the margin of error range up to 50%.

Conclusions for LOS (Length-of-stay)

Hospital stays cost the health system at least a big amount of money. U.S. Hospital for example spends $377.5 billion per year in the health system and recent Medicare legislation standardizes payments for procedures performed, regardless of the number of days a patient spends in the hospital.

This incentivizes hospitals to identify patients of high LOS risk at the time of admission. Once identified, patients with high LOS risk can have their treatment plan optimized to minimize LOS and lower the chance of getting a hospital-acquired condition. Another benefit is that prior knowledge of LOS can aid in logistics such as room and bed allocation planning.